{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<p align=\"center\">\n",
    "    <img src=\"https://github.com/GeostatsGuy/GeostatsPy/blob/master/TCG_color_logo.png?raw=true\" width=\"220\" height=\"240\" />\n",
    "\n",
    "</p>\n",
    "\n",
    "## Data Analytics \n",
    "\n",
    "### Confidence Intervals and Hypothesis Testing in Python in Python \n",
    "\n",
    "\n",
    "#### Michael Pyrcz, Associate Professor, The University of Texas at Austin \n",
    "\n",
    "##### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig)  | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is a tutorial / demonstration of **Confidence Intervals and Hypothesis Testing in Python**.  In Python, the SciPy package, specifically the Stats functions (https://docs.scipy.org/doc/scipy/reference/stats.html) provide excellent tools for efficient use of statistics.  \n",
    "\n",
    "I have previously provided these examples worked out by-hand in Excel (https://github.com/GeostatsGuy/LectureExercises/blob/master/Lecture7_CI_Hypoth_eg_R.xlsx) and also in R (https://github.com/GeostatsGuy/LectureExercises/blob/master/Lecture7_CI_Hypoth_eg.R).  In all cases, I use the same dataset available as a comma delimited file (https://git.io/fxLAt).    \n",
    "\n",
    "This tutorial includes basic, typical confidence interval and hypothesis testing methods that would commonly be required for Engineers and Geoscientists including:\n",
    "\n",
    "1. Student-t confidence interval for the mean and proportion\n",
    "2. Student-t hypothesis test for difference in means (pooled variance)\n",
    "3. Student-t hypothesis test for difference in means (difference variances), Welch's t Test\n",
    "3. F-distribution hypothesis test for difference in variances \n",
    "\n",
    "##### Caveats\n",
    "\n",
    "I have not included all the details, specifically the test assumptions in this document.  These are included in the accompanying course notes, Lec08_hypothesis.pdf.\n",
    "\n",
    "#### Project Goal\n",
    "\n",
    "0. Introduction to Python in Jupyter including setting a working directory, loading data into a Pandas DataFrame.\n",
    "1. Learn the basics for working with confidence intervals and hypothesis testing in Python.  \n",
    "2. Demonstrate the efficiency of using Python and SciPy package for statistical analysis.\n",
    "\n",
    "#### Load the required libraries\n",
    "\n",
    "The following code loads the required libraries.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os                                                   # to set current working directory \n",
    "import numpy as np                                          # arrays and matrix math\n",
    "import scipy.stats as stats                                 # statistical methods\n",
    "import pandas as pd                                         # DataFrames\n",
    "import math                                                 # square root\n",
    "import statistics                                           # statistics"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you get a package import error, you may have to first install some of these packages. This can usually be accomplished by opening up a command window on Windows and then typing 'python -m pip install [package-name]'. More assistance is available with the respective package docs.  \n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Set the working directory\n",
    "\n",
    "I always like to do this so I don't lose files and to simplify subsequent read and writes (avoid including the full address each time).  Also, in this case make sure to place the required (see below) data file in this directory.  When we are done with this tutorial we will write our new dataset back to this directory.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "#os.chdir(\"C:\\PGE337\")                                  # set the working directory"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Loading Data\n",
    "\n",
    "Let's load the provided dataset. 'PorositySamples2Units.csv' is available at https://github.com/GeostatsGuy/GeoDataSets. It is a comma delimited file with 20 porosity measures from 2 rock units from the subsurface, porosity (as a fraction). We load it with the pandas 'read_csv' function into a data frame we called 'df' and then preview it by printing a slice and by utilizing the 'head' DataFrame member function (with a nice and clean format, see below).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "     X1    X2\n",
      "0  0.21  0.20\n",
      "1  0.17  0.26\n",
      "2  0.15  0.20\n",
      "3  0.20  0.19\n",
      "4  0.19  0.13\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>X1</th>\n",
       "      <th>X2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.21</td>\n",
       "      <td>0.20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.17</td>\n",
       "      <td>0.26</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.15</td>\n",
       "      <td>0.20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.20</td>\n",
       "      <td>0.19</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.19</td>\n",
       "      <td>0.13</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     X1    X2\n",
       "0  0.21  0.20\n",
       "1  0.17  0.26\n",
       "2  0.15  0.20\n",
       "3  0.20  0.19\n",
       "4  0.19  0.13"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#df = pd.read_csv(\"PorositySample2Units.csv\")                # read a .csv file in as a DataFrame\n",
    "df = pd.read_csv(r\"https://raw.githubusercontent.com/GeostatsGuy/GeoDataSets/master/PorositySample2Units.csv\") # load data from Prof. Pyrcz's github\n",
    "print(df.iloc[0:5,:])                                       # display first 4 samples in the table as a preview\n",
    "df.head()                                                   # we could also use this command for a table preview "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It is useful to review the summary statistics of our loaded DataFrame.  That can be accomplished with the 'describe' DataFrame member function.  We transpose to switch the axes for ease of visualization."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>count</th>\n",
       "      <th>mean</th>\n",
       "      <th>std</th>\n",
       "      <th>min</th>\n",
       "      <th>25%</th>\n",
       "      <th>50%</th>\n",
       "      <th>75%</th>\n",
       "      <th>max</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>X1</th>\n",
       "      <td>20.0</td>\n",
       "      <td>0.1645</td>\n",
       "      <td>0.027810</td>\n",
       "      <td>0.11</td>\n",
       "      <td>0.1500</td>\n",
       "      <td>0.17</td>\n",
       "      <td>0.19</td>\n",
       "      <td>0.21</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>X2</th>\n",
       "      <td>20.0</td>\n",
       "      <td>0.2000</td>\n",
       "      <td>0.045422</td>\n",
       "      <td>0.11</td>\n",
       "      <td>0.1675</td>\n",
       "      <td>0.20</td>\n",
       "      <td>0.23</td>\n",
       "      <td>0.30</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    count    mean       std   min     25%   50%   75%   max\n",
       "X1   20.0  0.1645  0.027810  0.11  0.1500  0.17  0.19  0.21\n",
       "X2   20.0  0.2000  0.045422  0.11  0.1675  0.20  0.23  0.30"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.describe().transpose()   "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we extract the X1 and X2 unit porosity samples from the DataFrame into separate arrays called 'X1' and 'X2' for convenience."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "X1 = df['X1'].values\n",
    "X2 = df['X2'].values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Confidence Interval for the Mean\n",
    "\n",
    "Let's first demonstrate the calculation of the confidence interval for the sample mean at a 95% confidence level.  This could be interpreted as the interval over which there is a 95% confidence that it contains the true population.  We use the student's t distribution as we assume we do not know the variance and the sample size is small. \n",
    "\n",
    "\\begin{equation}\n",
    "x̅ \\pm t_{\\frac{\\alpha}{2},n-1} \\times \\frac {s}{\\sqrt{n}} \n",
    "\\end{equation}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The confidence interval for the X1 mean is [0.151 0.178]\n"
     ]
    }
   ],
   "source": [
    "ci_mean_95_x1 = stats.t.interval(0.95, len(df)-1, loc=np.mean(X1), scale=stats.sem(X1))\n",
    "print('The confidence interval for the X1 mean is ' + str(np.round(ci_mean_95_x1,3)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Confidence Interval for the Mean By-Hand\n",
    "\n",
    "Let's not use the interval function and do each part by-hand."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Statistic +/- t-score x Standard Error\n",
      "   0.164  +/-   2.09  x     0.00622\n",
      "The confidence interval by-hand for the X1 mean is [0.151 0.178]\n"
     ]
    }
   ],
   "source": [
    "sample_mean = np.mean(X1)\n",
    "tscore = stats.t.ppf([0.025,0.975], df = len(df)-1)\n",
    "SE = statistics.stdev(X1)/math.sqrt(len(df))\n",
    "ci_mean_BH_95_x1 = sample_mean + tscore * SE\n",
    "print('Statistic +/- t-score x Standard Error')\n",
    "print('   ' + str(np.round(sample_mean,3)) + '  +/-   ' + str(np.round(tscore[1],2)) + '  x     ' + str(np.round(SE,5)))\n",
    "print('The confidence interval by-hand for the X1 mean is ' + str(np.round(ci_mean_BH_95_x1,3)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Confidence Interval for the Proportion\n",
    "\n",
    "Now I demonstrate the calculation of the confidence interval for a proportion.\n",
    "\n",
    "\\begin{equation}\n",
    "\\hat{p} \\pm t_{\\frac{\\alpha}{2},n-1} \\times \\frac {\\hat{p}(1-\\hat{p})}{\\sqrt{n}} \n",
    "\\end{equation}\n",
    "\n",
    "First we need to make a new categorical feature to demonstrate confidence intervals for proportions.\n",
    "\n",
    "* we will assign samples as category 'high porosity' when their porosity is greater than 18%"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "New binary feature, 1 = high porosity, 0 = low porosity\n",
      "[1 0 0 1 1 1 0 0 0 0 0 0 1 0 0 0 0 1 1 0]\n",
      "\n",
      "Proportion of high porosity rock in well X1 is 0.35\n"
     ]
    }
   ],
   "source": [
    "CX1 = np.where(X1 < 0.18,0,1)\n",
    "prop_CX1 = np.sum(CX1 == 1)/CX1.shape[0]\n",
    "print('New binary feature, 1 = high porosity, 0 = low porosity')\n",
    "print(CX1)\n",
    "\n",
    "print('\\nProportion of high porosity rock in well X1 is ' + str(np.round(prop_CX1,2)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we are ready to calculate the confidence interval in the proportion with the SciPy stats t interval function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The confidence interval for the X1 proportion of high porosity is [0.244 0.456]\n"
     ]
    }
   ],
   "source": [
    "SE_prop = (prop_CX1*(1-prop_CX1))/math.sqrt(CX1.shape[0])\n",
    "ci_prop_95_x1 = stats.t.interval(0.95, CX1.shape[0]-1, loc=prop_CX1, scale=SE_prop)\n",
    "print('The confidence interval for the X1 proportion of high porosity is ' + str(np.round(ci_prop_95_x1,3)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Confidence Interval for the Proportion By-Hand\n",
    "\n",
    "Let's repeat this calculation by-hand without the SciPy stats function to test our knowledge."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Statistic +/- t-score x Standard Error\n",
      "   0.35   +/-   2.09  x     0.0509\n",
      "The confidence interval for the X1 proportion of high porosity is [0.244 0.456]\n"
     ]
    }
   ],
   "source": [
    "sample_prop = np.sum(CX1 == 1)/CX1.shape[0]\n",
    "tscore = stats.t.ppf([0.025,0.975], df = len(df)-1)\n",
    "SE_prop = (prop_CX1*(1-prop_CX1))/math.sqrt(CX1.shape[0])\n",
    "ci_prop_BH_95_x1 = sample_prop + tscore * SE_prop\n",
    "print('Statistic +/- t-score x Standard Error')\n",
    "print('   ' + str(np.round(sample_prop,3)) + '   +/-   ' + str(np.round(tscore[1],2)) + '  x     ' + str(np.round(SE_prop,4)))\n",
    "print('The confidence interval for the X1 proportion of high porosity is ' + str(np.round(ci_prop_BH_95_x1,3)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One can check the Excel file linked above with the confidence interval calculated by hand and confirm that this result is correct.\n",
    "\n",
    "##### Hypothesis Testing\n",
    "\n",
    "Now, let's try the t test, hypothesis test for difference in means. This test assumes that the variances are similar along with the data being Gaussian distributed (see the course notes for more on this).  This is our test:\n",
    "\n",
    "\\begin{equation}\n",
    "H_0: \\mu_{X1} = \\mu_{X2}\n",
    "\\end{equation}\n",
    "\n",
    "\\begin{equation}\n",
    "H_1: \\mu_{X1} \\ne \\mu_{X2}\n",
    "\\end{equation}\n",
    "\n",
    "For the resulting t-statistic and p-value we run this command.\n",
    "\n",
    "##### Pooled Variance t-test Difference in Means"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The t statistic is -2.98 and the p-value is 0.00499\n"
     ]
    }
   ],
   "source": [
    "t_pooled, p_pooled = stats.ttest_ind(X1,X2) # assuminng equal variance\n",
    "print('The t statistic is ' + str(np.round(t_pooled,2)) + ' and the p-value is ' + str(np.round(p_pooled,5)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The p-value, $p$, is the symmetric interval probaiblity our outside.  In other words the $p$ reported is 2 x cumulative probability of the t statistic applied to the sampling t distribution.  Another way to look at it, if one used the $\\pm t_{t_{statistic},.d.f}$ statistic as thresholds, $p$ is the probability being outside this symmetric interval. So we will reject the null hypothesis if $p \\lt \\alpha$.  From the p-value alone it is clear that we would reject the null hypothesis and accept the alternative hypothesis that the means are not equal.  \n",
    "\n",
    "In case you want to compare the t-statistic to t-critical, we can apply the inverse of the student's t distribution at $\\frac{\\alpha}{2}$ and $1-\\frac{\\alpha}{2}$ to get the upper and lower critcal values.       "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The t crical lower and upper values are [-2.02  2.02]\n"
     ]
    }
   ],
   "source": [
    "t_critical = stats.t.ppf([0.025,0.975], df=len(X1)+len(X2)-2)\n",
    "print('The t crical lower and upper values are ' + str(np.round(t_critical,2)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can observe that, as expected, the t-statistic is outside the t-critcal interval.  These results are exactly what we got when we worked out the problem by hand in Excel, but so much more efficient!  \n",
    "\n",
    "##### Welch's t-test Difference in Means \n",
    "\n",
    "Now let's try the t-test, hypothesis test for difference in means allowing for unequal variances, this is also known as the Welch's t test.  All we have to do is set the parameter 'equal_var' to false, note it defaults to true (e.g. the command above). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Ttest_indResult(statistic=-2.9808897468855644, pvalue=0.005502572350112333)"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "stats.ttest_ind(X1, X2, equal_var = False) # allowing for difference in variance"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once again we can see by $p$ that we will clearly reject the null hypothesis.  \n",
    "\n",
    "##### F-test Difference in Variances\n",
    "\n",
    "Let's now compare the variances with the F-test for difference in variances.  \n",
    "\n",
    "\\begin{equation}\n",
    "H_0: \\frac{\\sigma^{2}_{X_2}}{\\sigma^{2}_{X_1}} = 1.0\n",
    "\\end{equation}\n",
    "\n",
    "\\begin{equation}\n",
    "H_1: \\frac{\\sigma^{2}_{X_2}}{\\sigma^{2}_{X_1}} > 1.0\n",
    "\\end{equation}\n",
    "\n",
    "Note, by ordering the variances we eliminate the case of $\\sigma^{2}_{X_2} \\lt \\sigma^{2}_{X_1}$.\n",
    "\n",
    "Details about the test are available in the course notes (along with assumptions such as Gaussian distributed) and this example is also worked out by hand in the linked Excel workbook.  We can accomplish the F-test in with SciPy.Stats the function with one line of code if we calculate the ratio of the sample variances ensuring that the larger variance is in the numerator and get the degrees of freedom using the len() command, ensuring that we are consistent with the numerator degrees of freedom set as 'dfn' and the denominator degrees of freedom set as 'dfd'.  We take a p-value of $1-p$ since the test is configured to be a single, right tailed test.    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.01918734806315381"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "p_value = 1 - stats.f.cdf(np.var(X2)/np.var(X1), dfn=len(X2)-1, dfd=len(X1)-1)\n",
    "p_value"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once again we would clearly reject the null hypothesis since $p \\lt alpha$ and assume that the variances are not equal."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Compact Code Solution - Hypothesis Test, Student's t-test for Difference in Means"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Equal Variance: t-critical and t-statistic are -2.02 ≤ -2.98 ≤ 2.02; therefore, reject the null hypothesis\n",
      "Welch's test: t-critical and t-statistic are -2.02 ≤ -2.98 ≤ 2.02; therefore, reject the null hypothesis\n"
     ]
    }
   ],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import scipy.stats as stats\n",
    "\n",
    "df = pd.read_csv('https://raw.githubusercontent.com/GeostatsGuy/GeoDataSets/master/PorositySample2Units.csv')\n",
    "X1 = df['X1'].values; X2 = df['X2'].values\n",
    "\n",
    "alpha = 0.05\n",
    "t_critical = stats.t.ppf(alpha/2,len(X1)+len(X2)-2)\n",
    "\n",
    "t_pooled, p_pooled = stats.ttest_ind(X1,X2) # assuminng equal variance\n",
    "if(t_pooled < t_critical or t_pooled > -1*t_critical):\n",
    "    print('Equal Variance: t-critical and t-statistic are ' + str(np.round(t_critical,2)) + ' ≤ ' + str(np.round(t_pooled,2)) + \n",
    "      ' ≤ ' + str(np.round(-1*t_critical,2)) + '; therefore, reject the null hypothesis')\n",
    "else:\n",
    "    print('Equal Variance: t-critical and t-statistic are ' + str(np.round(t_critical,2)) + ' ≤ ' + str(np.round(t_pooled,2)) + \n",
    "      ' ≤ ' + str(np.round(-1*t_critical,2)) + '; therefore, fail to reject the null hypothesis')\n",
    "    \n",
    "    alpha = 0.05\n",
    "t_critical = stats.t.ppf(alpha/2,len(X1)+len(X2)-2)\n",
    "\n",
    "t_welch, p_welch = stats.ttest_ind(X1,X2,equal_var = False) # assuming unequal variance, Welch's t-test\n",
    "if(t_welch < t_critical or welch > -1*t_critical):\n",
    "    print('Welch\\'s test: t-critical and t-statistic are ' + str(np.round(t_critical,2)) + ' ≤ ' + str(np.round(t_welch,2)) + \n",
    "      ' ≤ ' + str(np.round(-1*t_critical,2)) + '; therefore, reject the null hypothesis')\n",
    "else:\n",
    "    print('Welch\\'s test: t-critical and t-statistic are ' + str(np.round(t_critical,2)) + ' ≤ ' + str(np.round(t_welch,2)) + \n",
    "      ' ≤ ' + str(np.round(-1*t_critical,2)) + '; therefore, fail to reject the null hypothesis')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Compact Code Solution - Hypothesis Test, F-test for Difference in Variances"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "f-statistic and f-critical are 2.67 > 2.17; therefore, reject the null hypothesis\n"
     ]
    }
   ],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import scipy.stats as stats\n",
    "\n",
    "df = pd.read_csv('https://raw.githubusercontent.com/GeostatsGuy/GeoDataSets/master/PorositySample2Units.csv')\n",
    "X1 = df['X1'].values; X2 = df['X2'].values\n",
    "\n",
    "alpha = 0.05\n",
    "\n",
    "var_X1 = np.var(X1,ddof=1); var_X2 = np.var(X2,ddof=1) # sample variance\n",
    "f_stat = np.max([var_X1,var_X2])/np.min([var_X1,var_X2])\n",
    "if var_X1 > var_X2:\n",
    "    f_critical = stats.f.ppf(1-alpha,len(X1)-1,len(X2)-1) \n",
    "else:\n",
    "    f_critical = stats.f.ppf(1-alpha,len(X2)-1,len(X2)-1) \n",
    "if f_stat > f_critical:\n",
    "    print('f-statistic and f-critical are ' + str(np.round(f_stat,2)) + ' > ' + str(np.round(f_critical,2)) \n",
    "          + '; therefore, reject the null hypothesis')\n",
    "else:\n",
    "    print('f-statistic and f-critical are ' + str(np.round(f_stat,2)) + ' ≤ ' + str(np.round(f_critical,2)) \n",
    "          + '; therefore, fail to reject the null hypothesis')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Comments\n",
    "\n",
    "We are just scratching the surface for confidence intervals and hypothesis tests.  Once again there are a lot of details left out of the problem formulation and assumptions, see the course notes for more coverage.  By running the same confidence interval and hypothesis tests 1) by hand in Excel and with 2) R and 3) Python code, I hope this demonstration will enable and encourage more engineers and scientists to make these R and Python tools part of their common practice. I'm always happy to discuss,\n",
    "\n",
    "*Michael*\n",
    "\n",
    "Michael Pyrcz, Ph.D., P.Eng. Associate Professor The Hildebrand Department of Petroleum and Geosystems Engineering, Bureau of Economic Geology, The Jackson School of Geosciences, The University of Texas at Austin\n",
    "On twitter I'm the @GeostatsGuy.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
